{"nbformat":4,"nbformat_minor":0,"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.5.2"},"colab":{"name":"MultiRC-NER.ipynb","provenance":[],"collapsed_sections":[],"toc_visible":true,"machine_shape":"hm"},"accelerator":"GPU"},"cells":[{"cell_type":"markdown","metadata":{"id":"PX4QQ3CFW1TF","colab_type":"text"},"source":["# **MultiRC** - Multihop multiple-choice question answering dataset\n","\n","# **Model** - NER-based QA\n","\n","**APPROACH** -\n","\n","**Dataset Preparation**\n","1. Concatenate paragraph + question + answers into a single context\n","2. Use discriminatory tags for each of- paragraph(P), question(Q), correct answer(C), wrong answer(W) and inside tags(I)\n","3. Now, the dataset is a CSV file with the following structure-\n","\n","\\<ID, TOKEN, TAG\\>\n","\n","where,\n","\n","ID- unique for every (paragraph,question,answers) combination\n","\n","TOKEN- paragraph + question + options concatenated  tokenized\n","\n","TAG - pre-determned tag for every portion in the context\n","\n","\n","**Model Preparation**\n","\n","4. Train the model to learn this variation of BIO tagging\n","\n","**Evaluation Preparation**\n","\n","5. Evaluate model's performnance against expected results- tagging the correct answer as CI tags and wrong answer as WI tags."]},{"cell_type":"markdown","metadata":{"id":"ulTtzBt1xM9G","colab_type":"text"},"source":["# NOTE : Search \"TODO\" to make changes for original/sampled data"]},{"cell_type":"code","metadata":{"id":"9ddbuTPIUi7h","colab_type":"code","colab":{}},"source":["%reload_ext autoreload\n","%autoreload 2\n","%matplotlib inline"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"jwPQNGAnWbHc","colab_type":"text"},"source":["# Mounting data\n","1. train.csv - training set\n","2. dev.csv - testing set\n","\n","Note- We are using validation set as our test set since the MultiRC test set is not publicly available and it's not possible to verify labels and analyse model performance"]},{"cell_type":"code","metadata":{"id":"oVMX9PY6U-cc","colab_type":"code","outputId":"2a222bbf-9ee5-494d-b157-2b8000486287","executionInfo":{"status":"ok","timestamp":1588622687318,"user_tz":420,"elapsed":1798,"user":{"displayName":"Soujanya Ranganatha Bhat","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjYZYrqnugKHbgoo144GZ9rzmvTfTGIL9eFkBCz=s64","userId":"15617339232293464832"}},"colab":{"base_uri":"https://localhost:8080/","height":34}},"source":["from google.colab import drive\n","drive.mount('/content/gdrive')"],"execution_count":2,"outputs":[{"output_type":"stream","text":["Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount(\"/content/gdrive\", force_remount=True).\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"id":"hB8x1XAPhB2D","colab_type":"code","colab":{}},"source":["PARENT_DIR = \"/content/gdrive/My Drive/MultiRC_NER\""],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"UutbeUmMVP_P","colab_type":"code","outputId":"09effe54-527a-4795-a4c6-b5f9a6807ea8","executionInfo":{"status":"ok","timestamp":1588622689663,"user_tz":420,"elapsed":4122,"user":{"displayName":"Soujanya Ranganatha Bhat","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjYZYrqnugKHbgoo144GZ9rzmvTfTGIL9eFkBCz=s64","userId":"15617339232293464832"}},"colab":{"base_uri":"https://localhost:8080/","height":68}},"source":["!ls \"/content/gdrive/My Drive/MultiRC_NER/data\""],"execution_count":4,"outputs":[{"output_type":"stream","text":["dev.csv\t\tdev_v3.csv  parsing_v5.py  train_sample.csv  train_v4.csv\n","dev_sample.csv\tdev_v4.csv  qa\t\t   train_v2.csv      train_v5.csv\n","dev_v2.csv\tdev_v5.csv  train.csv\t   train_v3.csv      vocab.txt\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"WaD3PslCWEVK","colab_type":"text"},"source":["# Requirements"]},{"cell_type":"code","metadata":{"id":"WyQ9p51QWDHq","colab_type":"code","outputId":"5e8d57a0-fe24-4fa0-e993-6ae0d5de80c9","executionInfo":{"status":"ok","timestamp":1588622696921,"user_tz":420,"elapsed":11367,"user":{"displayName":"Soujanya Ranganatha Bhat","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjYZYrqnugKHbgoo144GZ9rzmvTfTGIL9eFkBCz=s64","userId":"15617339232293464832"}},"colab":{"base_uri":"https://localhost:8080/","height":561}},"source":["!pip install seqeval\n","!pip install transformers"],"execution_count":5,"outputs":[{"output_type":"stream","text":["Requirement already satisfied: seqeval in /usr/local/lib/python3.6/dist-packages (0.0.12)\n","Requirement already satisfied: numpy>=1.14.0 in /usr/local/lib/python3.6/dist-packages (from seqeval) (1.18.3)\n","Requirement already satisfied: Keras>=2.2.4 in /usr/local/lib/python3.6/dist-packages (from seqeval) (2.3.1)\n","Requirement already satisfied: keras-preprocessing>=1.0.5 in /usr/local/lib/python3.6/dist-packages (from Keras>=2.2.4->seqeval) (1.1.0)\n","Requirement already satisfied: keras-applications>=1.0.6 in /usr/local/lib/python3.6/dist-packages (from Keras>=2.2.4->seqeval) (1.0.8)\n","Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from Keras>=2.2.4->seqeval) (1.12.0)\n","Requirement already satisfied: pyyaml in /usr/local/lib/python3.6/dist-packages (from Keras>=2.2.4->seqeval) (3.13)\n","Requirement already satisfied: scipy>=0.14 in /usr/local/lib/python3.6/dist-packages (from Keras>=2.2.4->seqeval) (1.4.1)\n","Requirement already satisfied: h5py in /usr/local/lib/python3.6/dist-packages (from Keras>=2.2.4->seqeval) (2.10.0)\n","Requirement already satisfied: transformers in /usr/local/lib/python3.6/dist-packages (2.8.0)\n","Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.6/dist-packages (from transformers) (2019.12.20)\n","Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from transformers) (1.18.3)\n","Requirement already satisfied: dataclasses; python_version < \"3.7\" in /usr/local/lib/python3.6/dist-packages (from transformers) (0.7)\n","Requirement already satisfied: sentencepiece in /usr/local/lib/python3.6/dist-packages (from transformers) (0.1.86)\n","Requirement already satisfied: sacremoses in /usr/local/lib/python3.6/dist-packages (from transformers) (0.0.43)\n","Requirement already satisfied: filelock in /usr/local/lib/python3.6/dist-packages (from transformers) (3.0.12)\n","Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.6/dist-packages (from transformers) (4.38.0)\n","Requirement already satisfied: boto3 in /usr/local/lib/python3.6/dist-packages (from transformers) (1.12.47)\n","Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from transformers) (2.23.0)\n","Requirement already satisfied: tokenizers==0.5.2 in /usr/local/lib/python3.6/dist-packages (from transformers) (0.5.2)\n","Requirement already satisfied: joblib in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers) (0.14.1)\n","Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers) (1.12.0)\n","Requirement already satisfied: click in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers) (7.1.2)\n","Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.6/dist-packages (from boto3->transformers) (0.9.5)\n","Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /usr/local/lib/python3.6/dist-packages (from boto3->transformers) (0.3.3)\n","Requirement already satisfied: botocore<1.16.0,>=1.15.47 in /usr/local/lib/python3.6/dist-packages (from boto3->transformers) (1.15.47)\n","Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->transformers) (3.0.4)\n","Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->transformers) (1.24.3)\n","Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->transformers) (2020.4.5.1)\n","Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->transformers) (2.9)\n","Requirement already satisfied: docutils<0.16,>=0.10 in /usr/local/lib/python3.6/dist-packages (from botocore<1.16.0,>=1.15.47->boto3->transformers) (0.15.2)\n","Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /usr/local/lib/python3.6/dist-packages (from botocore<1.16.0,>=1.15.47->boto3->transformers) (2.8.1)\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"id":"RcUnNx1WUi7l","colab_type":"code","colab":{}},"source":["import pandas as pd\n","import math\n","import numpy as np\n","from seqeval.metrics import f1_score\n","from seqeval.metrics import classification_report,accuracy_score,f1_score\n","import torch.nn.functional as F"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"nBU6k3dQUi7p","colab_type":"code","outputId":"53838a8b-205f-4c13-b1fb-de2098263d15","executionInfo":{"status":"ok","timestamp":1588622699179,"user_tz":420,"elapsed":13592,"user":{"displayName":"Soujanya Ranganatha Bhat","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjYZYrqnugKHbgoo144GZ9rzmvTfTGIL9eFkBCz=s64","userId":"15617339232293464832"}},"colab":{"base_uri":"https://localhost:8080/","height":34}},"source":["import torch\n","import os\n","from tqdm import tqdm,trange\n","from torch.optim import Adam\n","from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler\n","from keras.preprocessing.sequence import pad_sequences\n","from sklearn.model_selection import train_test_split\n","from transformers import BertTokenizer, BertConfig\n","from transformers import BertForTokenClassification, AdamW"],"execution_count":7,"outputs":[{"output_type":"stream","text":["Using TensorFlow backend.\n"],"name":"stderr"}]},{"cell_type":"code","metadata":{"id":"-sUo6M6fUi7t","colab_type":"code","outputId":"55787417-0f3d-40f0-db5c-c402c11b6bdf","executionInfo":{"status":"ok","timestamp":1588622701190,"user_tz":420,"elapsed":15586,"user":{"displayName":"Soujanya Ranganatha Bhat","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjYZYrqnugKHbgoo144GZ9rzmvTfTGIL9eFkBCz=s64","userId":"15617339232293464832"}},"colab":{"base_uri":"https://localhost:8080/","height":153}},"source":["# Check library version\n","!pip list | grep -E 'transformers|torch|Keras'"],"execution_count":8,"outputs":[{"output_type":"stream","text":["Keras                    2.3.1          \n","Keras-Applications       1.0.8          \n","Keras-Preprocessing      1.1.0          \n","torch                    1.5.0+cu101    \n","torchsummary             1.5.1          \n","torchtext                0.3.1          \n","torchvision              0.6.0+cu101    \n","transformers             2.8.0          \n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"8HmaFwPpUi7w","colab_type":"text"},"source":["This notebook works with env:"]},{"cell_type":"markdown","metadata":{"id":"g1jIVHbJUi7x","colab_type":"text"},"source":["- Keras                2.3.1                 \n","- torch                1.1.0                 \n","- transformers         2.2.0      "]},{"cell_type":"markdown","metadata":{"id":"BTn36HHiUi7x","colab_type":"text"},"source":["# Introduction"]},{"cell_type":"markdown","metadata":{"id":"3g3rC1GUUi7y","colab_type":"text"},"source":["NER-based QA with BERT, steps:"]},{"cell_type":"markdown","metadata":{"id":"kTKLxKvLUi7z","colab_type":"text"},"source":["- Load and preprocess data\n","- Parser data\n","- Make training data\n","- Train model\n","- Evaluate result\n","- Predict result"]},{"cell_type":"markdown","metadata":{"id":"4X_eNNqGUi75","colab_type":"text"},"source":["## Load data"]},{"cell_type":"markdown","metadata":{"id":"04kRhJ2mUi76","colab_type":"text"},"source":["**Load CSV data**"]},{"cell_type":"code","metadata":{"id":"7sZnV5gGUi76","colab_type":"code","colab":{}},"source":["data_path = PARENT_DIR + \"/data\" "],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"ctpiyXm9Ui7-","colab_type":"code","colab":{}},"source":["# TODO: \"train.csv\" - original, \"train_sample.csv\" - sampled file(1/100th data)\n","train_file_address = PARENT_DIR + \"/data/train_v5.csv\""],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"Z1j-H3sKUi8A","colab_type":"code","colab":{}},"source":["# Fillna method can make same sentence with same sentence name\n","# NOTE - encoding latin1 => utf-8\n","df_data = pd.read_csv(train_file_address,sep=\",\",encoding=\"utf-8\").fillna(method='ffill')"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"HuoBFgbsUi8D","colab_type":"code","outputId":"242ca3ac-ec60-418e-ef9c-6225a88a5e69","executionInfo":{"status":"ok","timestamp":1588622705615,"user_tz":420,"elapsed":19970,"user":{"displayName":"Soujanya Ranganatha Bhat","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjYZYrqnugKHbgoo144GZ9rzmvTfTGIL9eFkBCz=s64","userId":"15617339232293464832"}},"colab":{"base_uri":"https://localhost:8080/","height":34}},"source":["df_data.columns"],"execution_count":12,"outputs":[{"output_type":"execute_result","data":{"text/plain":["Index(['ID', 'TOKEN', 'TAG'], dtype='object')"]},"metadata":{"tags":[]},"execution_count":12}]},{"cell_type":"code","metadata":{"id":"Gf7dOgcAUi8H","colab_type":"code","outputId":"5c9431e4-7793-452a-fd60-7dd54278e96c","executionInfo":{"status":"ok","timestamp":1588622705617,"user_tz":420,"elapsed":19957,"user":{"displayName":"Soujanya Ranganatha Bhat","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjYZYrqnugKHbgoo144GZ9rzmvTfTGIL9eFkBCz=s64","userId":"15617339232293464832"}},"colab":{"base_uri":"https://localhost:8080/","height":669}},"source":["df_data.head(n=20)"],"execution_count":13,"outputs":[{"output_type":"execute_result","data":{"text/html":["<div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th></th>\n","      <th>ID</th>\n","      <th>TOKEN</th>\n","      <th>TAG</th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>0</th>\n","      <td>1</td>\n","      <td>Does</td>\n","      <td>Q</td>\n","    </tr>\n","    <tr>\n","      <th>1</th>\n","      <td>1</td>\n","      <td>the</td>\n","      <td>I</td>\n","    </tr>\n","    <tr>\n","      <th>2</th>\n","      <td>1</td>\n","      <td>author</td>\n","      <td>I</td>\n","    </tr>\n","    <tr>\n","      <th>3</th>\n","      <td>1</td>\n","      <td>claim</td>\n","      <td>I</td>\n","    </tr>\n","    <tr>\n","      <th>4</th>\n","      <td>1</td>\n","      <td>the</td>\n","      <td>I</td>\n","    </tr>\n","    <tr>\n","      <th>5</th>\n","      <td>1</td>\n","      <td>animated</td>\n","      <td>I</td>\n","    </tr>\n","    <tr>\n","      <th>6</th>\n","      <td>1</td>\n","      <td>films</td>\n","      <td>I</td>\n","    </tr>\n","    <tr>\n","      <th>7</th>\n","      <td>1</td>\n","      <td>message</td>\n","      <td>I</td>\n","    </tr>\n","    <tr>\n","      <th>8</th>\n","      <td>1</td>\n","      <td>is</td>\n","      <td>I</td>\n","    </tr>\n","    <tr>\n","      <th>9</th>\n","      <td>1</td>\n","      <td>that</td>\n","      <td>I</td>\n","    </tr>\n","    <tr>\n","      <th>10</th>\n","      <td>1</td>\n","      <td>the</td>\n","      <td>I</td>\n","    </tr>\n","    <tr>\n","      <th>11</th>\n","      <td>1</td>\n","      <td>NRA</td>\n","      <td>I</td>\n","    </tr>\n","    <tr>\n","      <th>12</th>\n","      <td>1</td>\n","      <td>upholds</td>\n","      <td>I</td>\n","    </tr>\n","    <tr>\n","      <th>13</th>\n","      <td>1</td>\n","      <td>racism</td>\n","      <td>I</td>\n","    </tr>\n","    <tr>\n","      <th>14</th>\n","      <td>1</td>\n","      <td>?</td>\n","      <td>I</td>\n","    </tr>\n","    <tr>\n","      <th>15</th>\n","      <td>1</td>\n","      <td>Yes</td>\n","      <td>C</td>\n","    </tr>\n","    <tr>\n","      <th>16</th>\n","      <td>1</td>\n","      <td>.</td>\n","      <td>I</td>\n","    </tr>\n","    <tr>\n","      <th>17</th>\n","      <td>1</td>\n","      <td>Animated</td>\n","      <td>P</td>\n","    </tr>\n","    <tr>\n","      <th>18</th>\n","      <td>1</td>\n","      <td>history</td>\n","      <td>I</td>\n","    </tr>\n","    <tr>\n","      <th>19</th>\n","      <td>1</td>\n","      <td>of</td>\n","      <td>I</td>\n","    </tr>\n","  </tbody>\n","</table>\n","</div>"],"text/plain":["    ID     TOKEN TAG\n","0    1      Does   Q\n","1    1       the   I\n","2    1    author   I\n","3    1     claim   I\n","4    1       the   I\n","5    1  animated   I\n","6    1     films   I\n","7    1   message   I\n","8    1        is   I\n","9    1      that   I\n","10   1       the   I\n","11   1       NRA   I\n","12   1   upholds   I\n","13   1    racism   I\n","14   1         ?   I\n","15   1       Yes   C\n","16   1         .   I\n","17   1  Animated   P\n","18   1   history   I\n","19   1        of   I"]},"metadata":{"tags":[]},"execution_count":13}]},{"cell_type":"markdown","metadata":{"id":"qkB6hAWKUi8L","colab_type":"text"},"source":["**TAG categories**\n"]},{"cell_type":"code","metadata":{"id":"-jE4_dKAUi8L","colab_type":"code","outputId":"4d36803a-0144-431b-e775-d877d87723e9","executionInfo":{"status":"ok","timestamp":1588622705892,"user_tz":420,"elapsed":20221,"user":{"displayName":"Soujanya Ranganatha Bhat","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjYZYrqnugKHbgoo144GZ9rzmvTfTGIL9eFkBCz=s64","userId":"15617339232293464832"}},"colab":{"base_uri":"https://localhost:8080/","height":34}},"source":["df_data.TAG.unique()"],"execution_count":14,"outputs":[{"output_type":"execute_result","data":{"text/plain":["array(['Q', 'I', 'C', 'P', 'W'], dtype=object)"]},"metadata":{"tags":[]},"execution_count":14}]},{"cell_type":"code","metadata":{"id":"PBwzJOIFUi8S","colab_type":"code","outputId":"f063458c-62b9-4165-fdf0-bc344edcb63c","executionInfo":{"status":"ok","timestamp":1588622706638,"user_tz":420,"elapsed":20956,"user":{"displayName":"Soujanya Ranganatha Bhat","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjYZYrqnugKHbgoo144GZ9rzmvTfTGIL9eFkBCz=s64","userId":"15617339232293464832"}},"colab":{"base_uri":"https://localhost:8080/","height":34}},"source":["# Data summary\n","df_data['ID'].nunique(), df_data.TOKEN.nunique(), df_data.TAG.nunique()"],"execution_count":15,"outputs":[{"output_type":"execute_result","data":{"text/plain":["(27243, 24067, 5)"]},"metadata":{"tags":[]},"execution_count":15}]},{"cell_type":"code","metadata":{"id":"iLeBXoekUi8U","colab_type":"code","outputId":"567a8b56-c2b7-4d6a-b3fe-921d4895cb79","executionInfo":{"status":"ok","timestamp":1588622707374,"user_tz":420,"elapsed":21679,"user":{"displayName":"Soujanya Ranganatha Bhat","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjYZYrqnugKHbgoo144GZ9rzmvTfTGIL9eFkBCz=s64","userId":"15617339232293464832"}},"colab":{"base_uri":"https://localhost:8080/","height":119}},"source":["# TAG distribution\n","df_data.TAG.value_counts()"],"execution_count":16,"outputs":[{"output_type":"execute_result","data":{"text/plain":["I    7710413\n","P     373775\n","Q      27243\n","W      15218\n","C      12025\n","Name: TAG, dtype: int64"]},"metadata":{"tags":[]},"execution_count":16}]},{"cell_type":"markdown","metadata":{"id":"vKbYp0YaUi8X","colab_type":"text"},"source":["### Explain TAG\n","As shown and explained above, there are 4 distinct tags, one each for- Paragraph, Question, Correct answer and Wrong answer\n","- P: Paragraph sentence begin, word at the first  position\n","- Q: Question sentence begin, word at the first  position\n","- C: Correct answer sentence begin, word at the first  position\n","- W: Wrong answer sentence begin, word at the first  position\n","- I: inside, word not at the first position, for sentences"]},{"cell_type":"markdown","metadata":{"id":"SmtZkBnFUi8e","colab_type":"text"},"source":["## Parser data"]},{"cell_type":"markdown","metadata":{"id":"SRHouNE4Ui8e","colab_type":"text"},"source":["**Parser data into document structure**"]},{"cell_type":"code","metadata":{"id":"HE87UalmUi8f","colab_type":"code","colab":{}},"source":["class SentenceGetter(object):\n","    \n","    def __init__(self, data):\n","        self.n_sent = 1\n","        self.data = data\n","        self.empty = False\n","        agg_func = lambda s: [(w, t) for w, t in zip(s[\"TOKEN\"].values.tolist(),\n","                                                           s[\"TAG\"].values.tolist())]\n","        self.grouped = self.data.groupby(\"ID\").apply(agg_func)\n","        self.sentences = [s for s in self.grouped]\n","    \n","    def get_next(self):\n","        try:\n","            s = self.grouped[\"Sentence: {}\".format(self.n_sent)]\n","            self.n_sent += 1\n","            return s\n","        except:\n","            return None"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"Eok4eiP9Ui8i","colab_type":"code","colab":{}},"source":["# Get full document data structure\n","getter = SentenceGetter(df_data)"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"gLrGqAXxUi8k","colab_type":"code","outputId":"988d0d3d-4be7-4882-af19-26a33abc3f0a","executionInfo":{"status":"ok","timestamp":1588622714332,"user_tz":420,"elapsed":28605,"user":{"displayName":"Soujanya Ranganatha Bhat","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjYZYrqnugKHbgoo144GZ9rzmvTfTGIL9eFkBCz=s64","userId":"15617339232293464832"}},"colab":{"base_uri":"https://localhost:8080/","height":1000}},"source":["# Get sentence data\n","sentences = [[s[0] for s in sent] for sent in getter.sentences]\n","sentences[0]"],"execution_count":19,"outputs":[{"output_type":"execute_result","data":{"text/plain":["['Does',\n"," 'the',\n"," 'author',\n"," 'claim',\n"," 'the',\n"," 'animated',\n"," 'films',\n"," 'message',\n"," 'is',\n"," 'that',\n"," 'the',\n"," 'NRA',\n"," 'upholds',\n"," 'racism',\n"," '?',\n"," 'Yes',\n"," '.',\n"," 'Animated',\n"," 'history',\n"," 'of',\n"," 'the',\n"," 'US',\n"," '.',\n"," 'Of',\n"," 'course',\n"," 'the',\n"," 'cartoon',\n"," 'is',\n"," 'highly',\n"," 'oversimplified',\n"," 'and',\n"," 'most',\n"," 'critics',\n"," 'consider',\n"," 'it',\n"," 'one',\n"," 'of',\n"," 'the',\n"," 'weakest',\n"," 'parts',\n"," 'of',\n"," 'the',\n"," 'film',\n"," '.',\n"," 'But',\n"," 'it',\n"," 'makes',\n"," 'a',\n"," 'valid',\n"," 'claim',\n"," 'which',\n"," 'you',\n"," 'ignore',\n"," 'entirely:',\n"," 'That',\n"," 'the',\n"," 'strategy',\n"," 'to',\n"," 'promote',\n"," 'gun',\n"," 'rights',\n"," 'for',\n"," 'white',\n"," 'people',\n"," 'and',\n"," 'to',\n"," 'outlaw',\n"," 'gun',\n"," 'possession',\n"," 'by',\n"," 'black',\n"," 'people',\n"," 'was',\n"," 'a',\n"," 'way',\n"," 'to',\n"," 'uphold',\n"," 'racism',\n"," 'without',\n"," 'letting',\n"," 'an',\n"," 'openly',\n"," 'terrorist',\n"," 'organization',\n"," 'like',\n"," 'the',\n"," 'KKK',\n"," 'flourish',\n"," '.',\n"," 'Did',\n"," 'the',\n"," '19th',\n"," 'century',\n"," 'NRA',\n"," 'in',\n"," 'the',\n"," 'southern',\n"," 'states',\n"," 'promote',\n"," 'gun',\n"," 'rights',\n"," 'for',\n"," 'black',\n"," 'people',\n"," '?',\n"," 'I',\n"," 'highly',\n"," 'doubt',\n"," 'it',\n"," '.',\n"," 'But',\n"," 'if',\n"," 'they',\n"," \"didn't\",\n"," 'one',\n"," 'of',\n"," 'their',\n"," 'functions',\n"," 'was',\n"," 'to',\n"," 'continue',\n"," 'the',\n"," 'racism',\n"," 'of',\n"," 'the',\n"," 'KKK',\n"," '.',\n"," 'This',\n"," 'is',\n"," 'the',\n"," 'key',\n"," 'message',\n"," 'of',\n"," 'this',\n"," 'part',\n"," 'of',\n"," 'the',\n"," 'animation',\n"," 'which',\n"," 'is',\n"," 'again',\n"," 'being',\n"," 'ignored',\n"," 'by',\n"," 'its',\n"," 'critics',\n"," '.',\n"," 'Buell',\n"," 'shooting',\n"," 'in',\n"," 'Flint',\n"," '.',\n"," 'You',\n"," 'write:',\n"," 'Fact:',\n"," 'The',\n"," 'little',\n"," 'boy',\n"," 'was',\n"," 'the',\n"," 'class',\n"," 'thug',\n"," 'already',\n"," 'suspended',\n"," 'from',\n"," 'school',\n"," 'for',\n"," 'stabbing',\n"," 'another',\n"," 'kid',\n"," 'with',\n"," 'a',\n"," 'pencil',\n"," 'and',\n"," 'had',\n"," 'fought',\n"," 'with',\n"," 'Kayla',\n"," 'the',\n"," 'day',\n"," 'before',\n"," '.',\n"," 'This',\n"," 'characterization',\n"," 'of',\n"," 'a',\n"," 'six-year-old',\n"," 'as',\n"," 'a',\n"," 'pencil-stabbing',\n"," 'thug',\n"," 'is',\n"," 'exactly',\n"," 'the',\n"," 'kind',\n"," 'of',\n"," 'hysteria',\n"," 'that',\n"," \"Moore's\",\n"," 'film',\n"," 'warns',\n"," 'against',\n"," '.',\n"," 'It',\n"," 'is',\n"," 'the',\n"," 'typical',\n"," 'right-wing',\n"," 'reaction',\n"," 'which',\n"," 'looks',\n"," 'for',\n"," 'simple',\n"," 'answers',\n"," 'that',\n"," 'do',\n"," 'not',\n"," 'contradict',\n"," 'the',\n"," 'Republican',\n"," 'mindset',\n"," '.',\n"," 'The',\n"," 'kid',\n"," 'was',\n"," 'a',\n"," 'little',\n"," 'bastard',\n"," 'and',\n"," 'the',\n"," 'parents',\n"," 'were',\n"," 'involved',\n"," 'in',\n"," 'drugs',\n"," '--',\n"," 'case',\n"," 'closed',\n"," '.',\n"," 'But',\n"," 'why',\n"," 'do',\n"," 'people',\n"," 'deal',\n"," 'with',\n"," 'drugs',\n"," '?',\n"," 'Because',\n"," \"it's\",\n"," 'so',\n"," 'much',\n"," 'fun',\n"," 'to',\n"," 'do',\n"," 'so',\n"," '?',\n"," 'It',\n"," 'is',\n"," 'by',\n"," 'now',\n"," 'well',\n"," 'documented',\n"," 'that',\n"," 'the',\n"," 'CIA',\n"," 'tolerated',\n"," 'crack',\n"," 'sales',\n"," 'in',\n"," 'US',\n"," 'cities',\n"," 'to',\n"," 'fund',\n"," 'the',\n"," 'operation',\n"," 'of',\n"," 'South',\n"," 'American',\n"," 'contras',\n"," 'It',\n"," 'is',\n"," 'equally',\n"," 'well',\n"," 'known',\n"," 'that',\n"," 'the',\n"," 'so-called',\n"," 'war',\n"," 'on',\n"," 'drugs',\n"," 'begun',\n"," 'under',\n"," 'the',\n"," 'Nixon',\n"," 'administration',\n"," 'is',\n"," 'a',\n"," 'failure',\n"," 'which',\n"," 'has',\n"," 'cost',\n"," 'hundreds',\n"," 'of',\n"," 'billions',\n"," 'and',\n"," 'made',\n"," 'America',\n"," 'the',\n"," 'world',\n"," 'leader',\n"," 'in',\n"," 'prison',\n"," 'population',\n"," '(both',\n"," 'in',\n"," 'relative',\n"," 'and',\n"," 'absolute',\n"," 'numbers)',\n"," '.']"]},"metadata":{"tags":[]},"execution_count":19}]},{"cell_type":"code","metadata":{"id":"dQrZXSEcUi8p","colab_type":"code","outputId":"eb0c6312-a2ec-4d15-86e5-3ca8d616122b","executionInfo":{"status":"ok","timestamp":1588622715269,"user_tz":420,"elapsed":29530,"user":{"displayName":"Soujanya Ranganatha Bhat","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjYZYrqnugKHbgoo144GZ9rzmvTfTGIL9eFkBCz=s64","userId":"15617339232293464832"}},"colab":{"base_uri":"https://localhost:8080/","height":54}},"source":["# Get TAG labels data\n","labels = [[s[1] for s in sent] for sent in getter.sentences]\n","print(labels[0])"],"execution_count":20,"outputs":[{"output_type":"stream","text":["['Q', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'C', 'I', 'P', 'I', 'I', 'I', 'I', 'I', 'P', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'P', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'P', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'P', 'I', 'I', 'I', 'I', 'P', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'P', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'P', 'I', 'I', 'I', 'I', 'P', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'P', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'P', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'P', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'P', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'P', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'P', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I']\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"kMCiMsc7Ui8u","colab_type":"text"},"source":["**Convert TAG name into index for training**"]},{"cell_type":"code","metadata":{"id":"qpRLGoGuUi8v","colab_type":"code","colab":{}},"source":["tags_vals = list(set(df_data[\"TAG\"].values))"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"ExdzObDaUi8y","colab_type":"code","colab":{}},"source":["# Add X  label for word piece support\n","# Add [CLS] and [SEP] as BERT need\n","tags_vals.append('X')\n","tags_vals.append('[CLS]')\n","tags_vals.append('[SEP]')"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"_AG-a8kcUi81","colab_type":"code","colab":{}},"source":["tags_vals = set(tags_vals)"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"JxBhynbKUi83","colab_type":"code","outputId":"aa0fe6c9-8bdd-4a9e-8f82-983a82c8c484","executionInfo":{"status":"ok","timestamp":1588622715579,"user_tz":420,"elapsed":29805,"user":{"displayName":"Soujanya Ranganatha Bhat","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjYZYrqnugKHbgoo144GZ9rzmvTfTGIL9eFkBCz=s64","userId":"15617339232293464832"}},"colab":{"base_uri":"https://localhost:8080/","height":34}},"source":["tags_vals"],"execution_count":24,"outputs":[{"output_type":"execute_result","data":{"text/plain":["{'C', 'I', 'P', 'Q', 'W', 'X', '[CLS]', '[SEP]'}"]},"metadata":{"tags":[]},"execution_count":24}]},{"cell_type":"code","metadata":{"id":"5kC5wynXUi87","colab_type":"code","colab":{}},"source":["# Set a dict for mapping id to tag name\n","#tag2idx = {t: i for i, t in enumerate(tags_vals)}\n","\n","# Manual definition\n","tag2idx={'C': 2,\n"," 'I': 3,\n"," 'P': 0,\n"," 'Q': 1,\n"," 'W': 4,\n"," 'X':5,\n"," '[CLS]':6,\n"," '[SEP]':7}"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"q0SyPToeUi8_","colab_type":"code","outputId":"7a428000-7ab6-4cf7-86a7-8d0bf5107987","executionInfo":{"status":"ok","timestamp":1588622715583,"user_tz":420,"elapsed":29789,"user":{"displayName":"Soujanya Ranganatha Bhat","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjYZYrqnugKHbgoo144GZ9rzmvTfTGIL9eFkBCz=s64","userId":"15617339232293464832"}},"colab":{"base_uri":"https://localhost:8080/","height":34}},"source":["tag2idx"],"execution_count":26,"outputs":[{"output_type":"execute_result","data":{"text/plain":["{'C': 2, 'I': 3, 'P': 0, 'Q': 1, 'W': 4, 'X': 5, '[CLS]': 6, '[SEP]': 7}"]},"metadata":{"tags":[]},"execution_count":26}]},{"cell_type":"code","metadata":{"id":"x_DHev6tUi9C","colab_type":"code","colab":{}},"source":["# Mapping index to name (reverse)\n","tag2name={tag2idx[key] : key for key in tag2idx.keys()}"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"FEkTJIkBUi9J","colab_type":"code","outputId":"23429035-29b7-473c-9dd9-67fa436120d9","executionInfo":{"status":"ok","timestamp":1588622715587,"user_tz":420,"elapsed":29772,"user":{"displayName":"Soujanya Ranganatha Bhat","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjYZYrqnugKHbgoo144GZ9rzmvTfTGIL9eFkBCz=s64","userId":"15617339232293464832"}},"colab":{"base_uri":"https://localhost:8080/","height":34}},"source":["tag2name"],"execution_count":28,"outputs":[{"output_type":"execute_result","data":{"text/plain":["{0: 'P', 1: 'Q', 2: 'C', 3: 'I', 4: 'W', 5: 'X', 6: '[CLS]', 7: '[SEP]'}"]},"metadata":{"tags":[]},"execution_count":28}]},{"cell_type":"markdown","metadata":{"id":"-MaQoRbYUi9N","colab_type":"text"},"source":["## Preparation - Training Data"]},{"cell_type":"markdown","metadata":{"id":"ORMHJaGvUi9N","colab_type":"text"},"source":["Raw data => trainable data for BERT, including:"]},{"cell_type":"markdown","metadata":{"id":"tfRMtXayUi9O","colab_type":"text"},"source":["- GPU environment\n","- Loading tokenizer and tokenize\n","- Set 3 embeddings - token, mask word, segmentation\n","- Use the TRAIN and VALIDATION set"]},{"cell_type":"markdown","metadata":{"id":"0oUKNJBMUi9R","colab_type":"text"},"source":["**Setting-up GPU environment**"]},{"cell_type":"code","metadata":{"id":"9vtq3HojUi9S","colab_type":"code","colab":{}},"source":["device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n","n_gpu = torch.cuda.device_count()"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"kuu0D9gaUi9U","colab_type":"code","outputId":"415b7bb0-476f-497d-abe7-a6d2d8b27ea3","executionInfo":{"status":"ok","timestamp":1588622715589,"user_tz":420,"elapsed":29755,"user":{"displayName":"Soujanya Ranganatha Bhat","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjYZYrqnugKHbgoo144GZ9rzmvTfTGIL9eFkBCz=s64","userId":"15617339232293464832"}},"colab":{"base_uri":"https://localhost:8080/","height":34}},"source":["n_gpu"],"execution_count":30,"outputs":[{"output_type":"execute_result","data":{"text/plain":["1"]},"metadata":{"tags":[]},"execution_count":30}]},{"cell_type":"markdown","metadata":{"id":"Il3J1IZ5Ui9X","colab_type":"text"},"source":["### Loading Tokenizer"]},{"cell_type":"markdown","metadata":{"id":"NE6BEvm7Ui9Y","colab_type":"text"},"source":["Downloading the tokenizer file into GDrive folder first :\n","- [vocab.txt](https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt)"]},{"cell_type":"code","metadata":{"id":"wvvpxN3RUi9Y","colab_type":"code","colab":{}},"source":["vocabulary = PARENT_DIR + \"/models/vocab.txt\""],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"RKRuXFMiUi9a","colab_type":"code","colab":{}},"source":["# Length of the sentence = 384 (dataset analysis- paragraph + question + answers = ~ 350, generally.)\n","# CAUTION - should be less than 512\n","# TODO : try with increased length\n","max_len  = 384"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"ThgQqj6wUi9d","colab_type":"code","colab":{}},"source":["# load tokenizer, with manual file address or pretrained address\n","tokenizer=BertTokenizer(vocab_file=vocabulary,do_lower_case=False)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"kmXz0B4oUi9k","colab_type":"text"},"source":["**Tokenizer text**"]},{"cell_type":"markdown","metadata":{"id":"5ESveNcEUi9l","colab_type":"text"},"source":["- In hunggieface for bert, when come across OOV, will word piece the word\n","- We need to adjust the labels base on the tokenize result, “##abc” need to set label \"X\" \n","- Need to set \"[CLS]\" at front and \"[SEP]\" at the end, as what the paper do, [BERT indexer should add [CLS] and [SEP] tokens](https://github.com/allenai/allennlp/issues/2141)\n"]},{"cell_type":"code","metadata":{"id":"LamYC4kWUi9m","colab_type":"code","outputId":"6697571c-e8c4-4c70-9bf3-3546fd9b4838","executionInfo":{"status":"ok","timestamp":1588622973603,"user_tz":420,"elapsed":287731,"user":{"displayName":"Soujanya Ranganatha Bhat","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjYZYrqnugKHbgoo144GZ9rzmvTfTGIL9eFkBCz=s64","userId":"15617339232293464832"}},"colab":{"base_uri":"https://localhost:8080/","height":377}},"source":["tokenized_texts = []\n","word_piece_labels = []\n","i_inc = 0\n","for word_list,label in (zip(sentences,labels)):\n","    temp_lable = []\n","    temp_token = []\n","    \n","    # Add [CLS] at the front \n","    temp_lable.append('[CLS]')\n","    temp_token.append('[CLS]')\n","    \n","    for word,lab in zip(word_list,label):\n","        token_list = tokenizer.tokenize(word)\n","        for m,token in enumerate(token_list):\n","            temp_token.append(token)\n","            if m==0:\n","                temp_lable.append(lab)\n","            else:\n","                temp_lable.append('X')  \n","                \n","    # Add [SEP] at the end\n","    temp_lable.append('[SEP]')\n","    temp_token.append('[SEP]')\n","    \n","    tokenized_texts.append(temp_token)\n","    word_piece_labels.append(temp_lable)\n","    \n","    if 5 > i_inc:\n","        print(\"No.%d,len:%d\"%(i_inc,len(temp_token)))\n","        print(\"texts:%s\"%(\" \".join(temp_token)))\n","        print(\"No.%d,len:%d\"%(i_inc,len(temp_lable)))\n","        print(\"lables:%s\"%(\" \".join(temp_lable)))\n","    i_inc +=1\n","    \n","    \n","    "],"execution_count":34,"outputs":[{"output_type":"stream","text":["No.0,len:371\n","texts:[CLS] Does the author claim the animated films message is that the N ##RA up ##hold ##s racism ? Yes . Animated history of the US . Of course the cartoon is highly overs ##im ##plified and most critics consider it one of the weak ##est parts of the film . But it makes a valid claim which you ignore entirely : That the strategy to promote gun rights for white people and to out ##law gun possession by black people was a way to up ##hold racism without letting an openly terrorist organization like the K ##K ##K flourish . Did the 19th century N ##RA in the southern states promote gun rights for black people ? I highly doubt it . But if they didn ' t one of their functions was to continue the racism of the K ##K ##K . This is the key message of this part of the animation which is again being ignored by its critics . B ##uel ##l shooting in Flint . You write : F ##act : The little boy was the class th ##ug already suspended from school for stabbing another kid with a pencil and had fought with Kay ##la the day before . This characterization of a six - year - old as a pencil - stabbing th ##ug is exactly the kind of h ##yster ##ia that Moore ' s film warns against . It is the typical right - wing reaction which looks for simple answers that do not con ##tra ##dict the Republican minds ##et . The kid was a little bastard and the parents were involved in drugs - - case closed . But why do people deal with drugs ? Because it ' s so much fun to do so ? It is by now well documented that the CIA tolerate ##d crack sales in US cities to fund the operation of South American con ##tras It is equally well known that the so - called war on drugs begun under the Nixon administration is a failure which has cost hundreds of billion ##s and made America the world leader in prison population ( both in relative and absolute numbers ) . [SEP]\n","No.0,len:371\n","lables:[CLS] Q I I I I I I I I I I I X I X X I I C I P I I I I I P I I I I I I X X I I I I I I I I I X I I I I I P I I I I I I I I I X I I I I I I I I I I I I I X I I I I I I I I I I X I I I I I I I I I I X X I I P I I I I X I I I I I I I I I I I P I I I I P I I I X X I I I I I I I I I I I I X X I P I I I I I I I I I I I I I I I I I I I P X X I I I I P I X I X X I I I I I I I X I I I I I I I I I I I I I I I I X I I I I P I I I I X X X X I I I X X I X I I I I I I X X I I X X I I I I P I I I I X X I I I I I I I I I I X X I I I X I P I I I I I I I I I I I I I X I I I P I I I I I I I P I X X I I I I I I I P I I I I I I I I I X I I I I I I I I I I I I I X I I I I I I I I X X I I I I I I I I I I I I I I I I I X I I I I I I I I I I X I I I I I X I [SEP]\n","No.1,len:374\n","texts:[CLS] Does the author claim the animated films message is that the N ##RA up ##hold ##s racism ? Up ##hold and continue . Animated history of the US . Of course the cartoon is highly overs ##im ##plified and most critics consider it one of the weak ##est parts of the film . But it makes a valid claim which you ignore entirely : That the strategy to promote gun rights for white people and to out ##law gun possession by black people was a way to up ##hold racism without letting an openly terrorist organization like the K ##K ##K flourish . Did the 19th century N ##RA in the southern states promote gun rights for black people ? I highly doubt it . But if they didn ' t one of their functions was to continue the racism of the K ##K ##K . This is the key message of this part of the animation which is again being ignored by its critics . B ##uel ##l shooting in Flint . You write : F ##act : The little boy was the class th ##ug already suspended from school for stabbing another kid with a pencil and had fought with Kay ##la the day before . This characterization of a six - year - old as a pencil - stabbing th ##ug is exactly the kind of h ##yster ##ia that Moore ' s film warns against . It is the typical right - wing reaction which looks for simple answers that do not con ##tra ##dict the Republican minds ##et . The kid was a little bastard and the parents were involved in drugs - - case closed . But why do people deal with drugs ? Because it ' s so much fun to do so ? It is by now well documented that the CIA tolerate ##d crack sales in US cities to fund the operation of South American con ##tras It is equally well known that the so - called war on drugs begun under the Nixon administration is a failure which has cost hundreds of billion ##s and made America the world leader in prison population ( both in relative and absolute numbers ) . [SEP]\n","No.1,len:374\n","lables:[CLS] Q I I I I I I I I I I I X I X X I I C X I I I P I I I I I P I I I I I I X X I I I I I I I I I X I I I I I P I I I I I I I I I X I I I I I I I I I I I I I X I I I I I I I I I I X I I I I I I I I I I X X I I P I I I I X I I I I I I I I I I I P I I I I P I I I X X I I I I I I I I I I I I X X I P I I I I I I I I I I I I I I I I I I I P X X I I I I P I X I X X I I I I I I I X I I I I I I I I I I I I I I I I X I I I I P I I I I X X X X I I I X X I X I I I I I I X X I I X X I I I I P I I I I X X I I I I I I I I I I X X I I I X I P I I I I I I I I I I I I I X I I I P I I I I I I I P I X X I I I I I I I P I I I I I I I I I X I I I I I I I I I I I I I X I I I I I I I I X X I I I I I I I I I I I I I I I I I X I I I I I I I I I I X I I I I I X I [SEP]\n","No.2,len:371\n","texts:[CLS] Does the author claim the animated films message is that the N ##RA up ##hold ##s racism ? No . Animated history of the US . Of course the cartoon is highly overs ##im ##plified and most critics consider it one of the weak ##est parts of the film . But it makes a valid claim which you ignore entirely : That the strategy to promote gun rights for white people and to out ##law gun possession by black people was a way to up ##hold racism without letting an openly terrorist organization like the K ##K ##K flourish . Did the 19th century N ##RA in the southern states promote gun rights for black people ? I highly doubt it . But if they didn ' t one of their functions was to continue the racism of the K ##K ##K . This is the key message of this part of the animation which is again being ignored by its critics . B ##uel ##l shooting in Flint . You write : F ##act : The little boy was the class th ##ug already suspended from school for stabbing another kid with a pencil and had fought with Kay ##la the day before . This characterization of a six - year - old as a pencil - stabbing th ##ug is exactly the kind of h ##yster ##ia that Moore ' s film warns against . It is the typical right - wing reaction which looks for simple answers that do not con ##tra ##dict the Republican minds ##et . The kid was a little bastard and the parents were involved in drugs - - case closed . But why do people deal with drugs ? Because it ' s so much fun to do so ? It is by now well documented that the CIA tolerate ##d crack sales in US cities to fund the operation of South American con ##tras It is equally well known that the so - called war on drugs begun under the Nixon administration is a failure which has cost hundreds of billion ##s and made America the world leader in prison population ( both in relative and absolute numbers ) . [SEP]\n","No.2,len:371\n","lables:[CLS] Q I I I I I I I I I I I X I X X I I W I P I I I I I P I I I I I I X X I I I I I I I I I X I I I I I P I I I I I I I I I X I I I I I I I I I I I I I X I I I I I I I I I I X I I I I I I I I I I X X I I P I I I I X I I I I I I I I I I I P I I I I P I I I X X I I I I I I I I I I I I X X I P I I I I I I I I I I I I I I I I I I I P X X I I I I P I X I X X I I I I I I I X I I I I I I I I I I I I I I I I X I I I I P I I I I X X X X I I I X X I X I I I I I I X X I I X X I I I I P I I I I X X I I I I I I I I I I X X I I I X I P I I I I I I I I I I I I I X I I I P I I I I I I I P I X X I I I I I I I P I I I I I I I I I X I I I I I I I I I I I I I X I I I I I I I I X X I I I I I I I I I I I I I I I I I X I I I I I I I I I I X I I I I I X I [SEP]\n","No.3,len:401\n","texts:[CLS] Which key message ( s ) do ( es ) this passage say the critics ignored ? The strategy to promote gun rights for white people while out ##law ##ing it for black people allowed r ##ac ##isi ##m to continue without allowing to K ##K ##K to flourish . Animated history of the US . Of course the cartoon is highly overs ##im ##plified and most critics consider it one of the weak ##est parts of the film . But it makes a valid claim which you ignore entirely : That the strategy to promote gun rights for white people and to out ##law gun possession by black people was a way to up ##hold racism without letting an openly terrorist organization like the K ##K ##K flourish . Did the 19th century N ##RA in the southern states promote gun rights for black people ? I highly doubt it . But if they didn ' t one of their functions was to continue the racism of the K ##K ##K . This is the key message of this part of the animation which is again being ignored by its critics . B ##uel ##l shooting in Flint . You write : F ##act : The little boy was the class th ##ug already suspended from school for stabbing another kid with a pencil and had fought with Kay ##la the day before . This characterization of a six - year - old as a pencil - stabbing th ##ug is exactly the kind of h ##yster ##ia that Moore ' s film warns against . It is the typical right - wing reaction which looks for simple answers that do not con ##tra ##dict the Republican minds ##et . The kid was a little bastard and the parents were involved in drugs - - case closed . But why do people deal with drugs ? Because it ' s so much fun to do so ? It is by now well documented that the CIA tolerate ##d crack sales in US cities to fund the operation of South American con ##tras It is equally well known that the so - called war on drugs begun under the Nixon administration is a failure which has cost hundreds of billion ##s and made America the world leader in prison population ( both in relative and absolute numbers ) . [SEP]\n","No.3,len:401\n","lables:[CLS] Q I I X X X I X X X I I I I I I I C I I I I I I I I I I X X I I I I I I X X X I I I I I I X X I I I P I I I I I P I I I I I I X X I I I I I I I I I X I I I I I P I I I I I I I I I X I I I I I I I I I I I I I X I I I I I I I I I I X I I I I I I I I I I X X I I P I I I I X I I I I I I I I I I I P I I I I P I I I X X I I I I I I I I I I I I X X I P I I I I I I I I I I I I I I I I I I I P X X I I I I P I X I X X I I I I I I I X I I I I I I I I I I I I I I I I X I I I I P I I I I X X X X I I I X X I X I I I I I I X X I I X X I I I I P I I I I X X I I I I I I I I I I X X I I I X I P I I I I I I I I I I I I I X I I I P I I I I I I I P I X X I I I I I I I P I I I I I I I I I X I I I I I I I I I I I I I X I I I I I I I I X X I I I I I I I I I I I I I I I I I X I I I I I I I I I I X I I I I I X I [SEP]\n","No.4,len:378\n","texts:[CLS] Which key message ( s ) do ( es ) this passage say the critics ignored ? That it ant ##agon ##ized the K ##K ##K . Animated history of the US . Of course the cartoon is highly overs ##im ##plified and most critics consider it one of the weak ##est parts of the film . But it makes a valid claim which you ignore entirely : That the strategy to promote gun rights for white people and to out ##law gun possession by black people was a way to up ##hold racism without letting an openly terrorist organization like the K ##K ##K flourish . Did the 19th century N ##RA in the southern states promote gun rights for black people ? I highly doubt it . But if they didn ' t one of their functions was to continue the racism of the K ##K ##K . This is the key message of this part of the animation which is again being ignored by its critics . B ##uel ##l shooting in Flint . You write : F ##act : The little boy was the class th ##ug already suspended from school for stabbing another kid with a pencil and had fought with Kay ##la the day before . This characterization of a six - year - old as a pencil - stabbing th ##ug is exactly the kind of h ##yster ##ia that Moore ' s film warns against . It is the typical right - wing reaction which looks for simple answers that do not con ##tra ##dict the Republican minds ##et . The kid was a little bastard and the parents were involved in drugs - - case closed . But why do people deal with drugs ? Because it ' s so much fun to do so ? It is by now well documented that the CIA tolerate ##d crack sales in US cities to fund the operation of South American con ##tras It is equally well known that the so - called war on drugs begun under the Nixon administration is a failure which has cost hundreds of billion ##s and made America the world leader in prison population ( both in relative and absolute numbers ) . [SEP]\n","No.4,len:378\n","lables:[CLS] Q I I X X X I X X X I I I I I I I W I I X X I I X X I P I I I I I P I I I I I I X X I I I I I I I I I X I I I I I P I I I I I I I I I X I I I I I I I I I I I I I X I I I I I I I I I I X I I I I I I I I I I X X I I P I I I I X I I I I I I I I I I I P I I I I P I I I X X I I I I I I I I I I I I X X I P I I I I I I I I I I I I I I I I I I I P X X I I I I P I X I X X I I I I I I I X I I I I I I I I I I I I I I I I X I I I I P I I I I X X X X I I I X X I X I I I I I I X X I I X X I I I I P I I I I X X I I I I I I I I I I X X I I I X I P I I I I I I I I I I I I I X I I I P I I I I I I I P I X X I I I I I I I P I I I I I I I I I X I I I I I I I I I I I I I X I I I I I I I I X X I I I I I I I I I I I I I I I I I X I I I I I I I I I I X I I I I I X I [SEP]\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"36NKtzIoUi9y","colab_type":"text"},"source":["### Setting-up token embedding"]},{"cell_type":"markdown","metadata":{"id":"Zf0AGkN1Ui9y","colab_type":"text"},"source":["Pad or trim the text and label to fit the need for max len"]},{"cell_type":"code","metadata":{"id":"LgI58czOUi9z","colab_type":"code","outputId":"e77fa8a3-bccc-4ae4-8152-e3e7cfeefdc2","executionInfo":{"status":"ok","timestamp":1588622982410,"user_tz":420,"elapsed":296523,"user":{"displayName":"Soujanya Ranganatha Bhat","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjYZYrqnugKHbgoo144GZ9rzmvTfTGIL9eFkBCz=s64","userId":"15617339232293464832"}},"colab":{"base_uri":"https://localhost:8080/","height":561}},"source":["# Make text token into id\n","input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts],\n","                          maxlen=max_len, dtype=\"long\", truncating=\"post\", padding=\"post\")\n","print(input_ids[0])"],"execution_count":35,"outputs":[{"output_type":"stream","text":["[  101  7187  1103  2351  3548  1103  6608  2441  3802  1110  1115  1103\n","   151  9664  1146  8678  1116 16654   136  2160   119 24238  1607  1104\n","  1103  1646   119  2096  1736  1103 11540  1110  3023 17074  4060 18580\n","  1105  1211  4217  4615  1122  1141  1104  1103  4780  2556  2192  1104\n","  1103  1273   119  1252  1122  2228   170  9221  3548  1134  1128  8429\n","  3665   131  1337  1103  5564  1106  4609  2560  2266  1111  1653  1234\n","  1105  1106  1149  9598  2560  6224  1118  1602  1234  1108   170  1236\n","  1106  1146  8678 16654  1443  5074  1126  9990  9640  2369  1176  1103\n","   148  2428  2428 27760   119  2966  1103  2835  1432   151  9664  1107\n","  1103  2359  2231  4609  2560  2266  1111  1602  1234   136   146  3023\n","  4095  1122   119  1252  1191  1152  1238   112   189  1141  1104  1147\n","  4226  1108  1106  2760  1103 16654  1104  1103   148  2428  2428   119\n","  1188  1110  1103  2501  3802  1104  1142  1226  1104  1103  8794  1134\n","  1110  1254  1217  5794  1118  1157  4217   119   139 24741  1233  4598\n","  1107 17741   119  1192  3593   131   143 11179   131  1109  1376  2298\n","  1108  1103  1705 24438  9610  1640  6232  1121  1278  1111 24728  1330\n","  5102  1114   170 16372  1105  1125  3214  1114 11247  1742  1103  1285\n","  1196   119  1188 27419  1104   170  1565   118  1214   118  1385  1112\n","   170 16372   118 24728 24438  9610  1110  2839  1103  1912  1104   177\n"," 21878  1465  1115  4673   112   188  1273 21310  1222   119  1135  1110\n","  1103  4701  1268   118  3092  3943  1134  2736  1111  3014  6615  1115\n","  1202  1136 14255  4487 28113  1103  3215 10089  2105   119  1109  5102\n","  1108   170  1376  8735  1105  1103  2153  1127  2017  1107  5557   118\n","   118  1692  1804   119  1252  1725  1202  1234  2239  1114  5557   136\n","  2279  1122   112   188  1177  1277  4106  1106  1202  1177   136  1135\n","  1110  1118  1208  1218  8510  1115  1103  9878 21073  1181  8672  3813\n","  1107  1646  3038  1106  5841  1103  2805  1104  1375  1237 14255 25352\n","  1135  1110  7808  1218  1227  1115  1103  1177   118  1270  1594  1113\n","  5557  4972  1223  1103 11302  3469  1110   170  4290  1134  1144  2616\n","  5229  1104  3775  1116  1105  1189  1738  1103  1362  2301  1107  3315\n","  1416   113  1241  1107  5236  1105  7846  2849   114   119   102     0\n","     0     0     0     0     0     0     0     0     0     0     0     0]\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"id":"1hT4lASzUi91","colab_type":"code","outputId":"1f07d7e9-e838-457f-b2ac-83774dc0db2a","executionInfo":{"status":"ok","timestamp":1588622985146,"user_tz":420,"elapsed":299243,"user":{"displayName":"Soujanya Ranganatha Bhat","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjYZYrqnugKHbgoo144GZ9rzmvTfTGIL9eFkBCz=s64","userId":"15617339232293464832"}},"colab":{"base_uri":"https://localhost:8080/","height":204}},"source":["# Make label into id, pad with \"W\" meaning others/wrong\n","# Note - Replaced \"O\" -> \"W\" (wrong)\n","tags = pad_sequences([[tag2idx.get(l) for l in lab] for lab in word_piece_labels],\n","                     maxlen=max_len, value=tag2idx[\"W\"], padding=\"post\",\n","                     dtype=\"long\", truncating=\"post\")\n","print(tags[0])"],"execution_count":36,"outputs":[{"output_type":"stream","text":["[6 1 3 3 3 3 3 3 3 3 3 3 3 5 3 5 5 3 3 2 3 0 3 3 3 3 3 0 3 3 3 3 3 3 5 5 3\n"," 3 3 3 3 3 3 3 3 5 3 3 3 3 3 0 3 3 3 3 3 3 3 3 3 5 3 3 3 3 3 3 3 3 3 3 3 3\n"," 3 5 3 3 3 3 3 3 3 3 3 3 5 3 3 3 3 3 3 3 3 3 3 5 5 3 3 0 3 3 3 3 5 3 3 3 3\n"," 3 3 3 3 3 3 3 0 3 3 3 3 0 3 3 3 5 5 3 3 3 3 3 3 3 3 3 3 3 3 5 5 3 0 3 3 3\n"," 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 5 5 3 3 3 3 0 3 5 3 5 5 3 3 3 3 3 3 3 5\n"," 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 5 3 3 3 3 0 3 3 3 3 5 5 5 5 3 3 3 5 5 3 5\n"," 3 3 3 3 3 3 5 5 3 3 5 5 3 3 3 3 0 3 3 3 3 5 5 3 3 3 3 3 3 3 3 3 3 5 5 3 3\n"," 3 5 3 0 3 3 3 3 3 3 3 3 3 3 3 3 3 5 3 3 3 0 3 3 3 3 3 3 3 0 3 5 5 3 3 3 3\n"," 3 3 3 0 3 3 3 3 3 3 3 3 3 5 3 3 3 3 3 3 3 3 3 3 3 3 3 5 3 3 3 3 3 3 3 3 5\n"," 5 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 5 3 3 3 3 3 3 3 3 3 3 5 3 3 3 3 3 5 3\n"," 7 4 4 4 4 4 4 4 4 4 4 4 4 4]\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"Q90_O7JVUi96","colab_type":"text"},"source":["### Setting-up mask word embedding"]},{"cell_type":"code","metadata":{"id":"E03rTe7kUi96","colab_type":"code","colab":{}},"source":["# For fine tune of predict, with token mask is 1,pad token is 0\n","attention_masks = [[int(i>0) for i in ii] for ii in input_ids]\n","attention_masks[0];"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"mg7m0PyoUi99","colab_type":"text"},"source":["### Setting-up segment embedding(Analysis- for sequance tagging task, it's not necessary to make this embedding)"]},{"cell_type":"code","metadata":{"id":"M2mMlimtUi99","colab_type":"code","colab":{}},"source":["# Since only one sentence, all the segment set to 0\n","segment_ids = [[0] * len(input_id) for input_id in input_ids]\n","segment_ids[0];"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"s1fH7smpci0t","colab_type":"code","colab":{}},"source":["# print(segment_ids) # ERROR - IOPub data rate exceeded. (TOO MUCH!)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"HXfDgtzuUi-F","colab_type":"text"},"source":["## Load TRAIN and VALIDATION sets"]},{"cell_type":"markdown","metadata":{"id":"z37-OiuHUi-I","colab_type":"text"},"source":["**Split all data**"]},{"cell_type":"code","metadata":{"id":"ArF7Fq8fUi-J","colab_type":"code","colab":{}},"source":["tr_inputs, tr_tags, tr_masks, tr_segs = input_ids, tags, attention_masks, segment_ids"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"lYcZ4p6lUi-O","colab_type":"code","outputId":"3370bb41-9c8f-42e1-88d7-28a3ee73e934","executionInfo":{"status":"ok","timestamp":1588622991756,"user_tz":420,"elapsed":305801,"user":{"displayName":"Soujanya Ranganatha Bhat","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjYZYrqnugKHbgoo144GZ9rzmvTfTGIL9eFkBCz=s64","userId":"15617339232293464832"}},"colab":{"base_uri":"https://localhost:8080/","height":34}},"source":["len(tr_inputs),len(tr_segs)"],"execution_count":41,"outputs":[{"output_type":"execute_result","data":{"text/plain":["(27243, 27243)"]},"metadata":{"tags":[]},"execution_count":41}]},{"cell_type":"code","metadata":{"id":"v38ZLJ0-tKDp","colab_type":"code","outputId":"67bbe4cc-d7fe-442b-b4c8-41a08f871f2e","executionInfo":{"status":"ok","timestamp":1588622991757,"user_tz":420,"elapsed":305788,"user":{"displayName":"Soujanya Ranganatha Bhat","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjYZYrqnugKHbgoo144GZ9rzmvTfTGIL9eFkBCz=s64","userId":"15617339232293464832"}},"colab":{"base_uri":"https://localhost:8080/","height":136}},"source":["print(tr_inputs)"],"execution_count":42,"outputs":[{"output_type":"stream","text":["[[ 101 7187 1103 ...    0    0    0]\n"," [ 101 7187 1103 ...    0    0    0]\n"," [ 101 7187 1103 ...    0    0    0]\n"," ...\n"," [ 101 2627 1110 ... 3577 1290 1697]\n"," [ 101 2627 1110 ... 1456 3577 1290]\n"," [ 101 2627 1110 ... 1206 1103 1244]]\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"wiMXgofyUi-T","colab_type":"text"},"source":["**Set data into tensor**"]},{"cell_type":"markdown","metadata":{"id":"JXbuWgNaUi-T","colab_type":"text"},"source":["NOTE - Not recommend tensor.to(device) at this process, since it will run out of GPU memory"]},{"cell_type":"code","metadata":{"id":"_LEsSzo2Ui-U","colab_type":"code","colab":{}},"source":["tr_inputs = torch.tensor(tr_inputs)\n","tr_tags = torch.tensor(tr_tags)\n","tr_masks = torch.tensor(tr_masks)\n","tr_segs = torch.tensor(tr_segs)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"7nR_mX0aUi-a","colab_type":"text"},"source":["**Put data into data loader**"]},{"cell_type":"code","metadata":{"id":"-iD8Yay_Ui-b","colab_type":"code","colab":{}},"source":["# Set batch num\n","batch_num = 16"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"uFWiVf50Ui-e","colab_type":"code","colab":{}},"source":["# Only set token embedding, attention embedding, no segment embedding\n","train_data = TensorDataset(tr_inputs, tr_masks, tr_tags)\n","train_sampler = RandomSampler(train_data)\n","# Drop last can make batch training better for the last one\n","train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_num,drop_last=True)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"C4JdNTXXUi-i","colab_type":"text"},"source":["## Train model"]},{"cell_type":"markdown","metadata":{"id":"5bIB2t48Ui-i","colab_type":"text"},"source":["- Pre-requisite: Downloading model files in GDrive\n","- Model used - BERT-base-cased\n","- pytorch_model.bin: [pytorch_model.bin](https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-pytorch_model.bin)\n","- config.json: [config.json](https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-config.json)    "]},{"cell_type":"markdown","metadata":{"id":"tqQuPl0yUi-j","colab_type":"text"},"source":["**Loading BERT model**"]},{"cell_type":"code","metadata":{"id":"CpnX05xiUi-j","colab_type":"code","colab":{}},"source":["# In this folder, contain model confg(json) and model weight(bin) files\n","# pytorch_model.bin, download from: https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-pytorch_model.bin\n","# config.json, downlaod from: https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-config.json\n","model_file_address = PARENT_DIR + \"/models\""],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"Gtov4DiFfinI","colab_type":"code","outputId":"6014e42d-8dba-44db-8454-93b224d719ba","executionInfo":{"status":"ok","timestamp":1588622995072,"user_tz":420,"elapsed":309063,"user":{"displayName":"Soujanya Ranganatha Bhat","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjYZYrqnugKHbgoo144GZ9rzmvTfTGIL9eFkBCz=s64","userId":"15617339232293464832"}},"colab":{"base_uri":"https://localhost:8080/","height":34}},"source":["!ls \"/content/gdrive/My Drive/MultiRC_NER/models\""],"execution_count":47,"outputs":[{"output_type":"stream","text":["config.json  pytorch_model.bin\tvocab.txt\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"id":"Jr7114QmUi-k","colab_type":"code","colab":{}},"source":["# Will load config and weight with from_pretrained()\n","model = BertForTokenClassification.from_pretrained(model_file_address,num_labels=len(tag2idx))"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"Z55qnZFjUi-m","colab_type":"code","colab":{}},"source":["model;"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"nX7OqEG4Ui-o","colab_type":"code","colab":{}},"source":["# Set model to GPU,if you are using GPU machine\n","model.cuda();"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"rXgBBGaYUi-p","colab_type":"code","colab":{}},"source":["# Add multi GPU support\n","#if n_gpu >1:\n"," #   model = torch.nn.DataParallel(model)"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"ToxEzMs9Ui-r","colab_type":"code","colab":{}},"source":["# Set epoch and grad max num\n","epochs = 5\n","max_grad_norm = 1.0"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"vDeMuG_qUi-t","colab_type":"code","colab":{}},"source":["# Cacluate train optimiazaion num\n","num_train_optimization_steps = int( math.ceil(len(tr_inputs) / batch_num) / 1) * epochs"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"FJ1liMnDUi-z","colab_type":"text"},"source":["### Setting-up fine tuning method"]},{"cell_type":"markdown","metadata":{"id":"43hTJyHJUi-z","colab_type":"text"},"source":["**Manual optimizer**"]},{"cell_type":"code","metadata":{"id":"OvuJtdG7Ui-z","colab_type":"code","colab":{}},"source":["# True: fine tuning all the layers \n","# False: only fine tuning the classifier layers\n","FULL_FINETUNING = True"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"Iv1pdD9YUi-2","colab_type":"code","colab":{}},"source":["if FULL_FINETUNING:\n","    # Fine tune model all layer parameters\n","    param_optimizer = list(model.named_parameters())\n","    no_decay = ['bias', 'gamma', 'beta']\n","    optimizer_grouped_parameters = [\n","        {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],\n","         'weight_decay_rate': 0.01},\n","        {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],\n","         'weight_decay_rate': 0.0}\n","    ]\n","else:\n","    # Only fine tune classifier parameters\n","    param_optimizer = list(model.classifier.named_parameters()) \n","    optimizer_grouped_parameters = [{\"params\": [p for n, p in param_optimizer]}]\n","optimizer = AdamW(optimizer_grouped_parameters, lr=3e-5)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"7ZybMSrOUi-8","colab_type":"text"},"source":["### Fine-tuning model"]},{"cell_type":"code","metadata":{"id":"Lz3zEEi_Ui-8","colab_type":"code","colab":{}},"source":["# TRAIN loop\n","model.train();"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"mS_C97Sou9KF","colab_type":"code","colab":{}},"source":["# Check logs for crash\n","#!cat /var/log/colab-jupyter.log"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"scrolled":false,"id":"DZ95271QUi--","colab_type":"code","outputId":"9c3e3002-de08-47c5-a4ac-a3f301042472","executionInfo":{"status":"ok","timestamp":1588583691688,"user_tz":420,"elapsed":3264559,"user":{"displayName":"Soujanya Ranganatha Bhat","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjYZYrqnugKHbgoo144GZ9rzmvTfTGIL9eFkBCz=s64","userId":"15617339232293464832"}},"colab":{"base_uri":"https://localhost:8080/","height":204}},"source":["print(\"***** Running training *****\")\n","print(\"  Num examples = %d\"%(len(tr_inputs)))\n","print(\"  Batch size = %d\"%(batch_num))\n","print(\"  Num steps = %d\"%(num_train_optimization_steps))\n","for _ in trange(epochs,desc=\"Epoch\"):\n","    tr_loss = 0\n","    nb_tr_examples, nb_tr_steps = 0, 0\n","    for step, batch in enumerate(train_dataloader):\n","        # add batch to gpu\n","        batch = tuple(t.to(device) for t in batch)\n","        b_input_ids, b_input_mask, b_labels = batch\n","        \n","        # forward pass\n","        outputs = model(b_input_ids, token_type_ids=None,\n","        attention_mask=b_input_mask, labels=b_labels)\n","        loss, scores = outputs[:2]\n","      #  if n_gpu>1:\n","            # When multi gpu, average it\n","       #     loss = loss.mean()\n","        \n","        # backward pass\n","        loss.backward()\n","        \n","        # track train loss\n","        tr_loss += loss.item()\n","        nb_tr_examples += b_input_ids.size(0)\n","        nb_tr_steps += 1\n","        \n","        # gradient clipping\n","        torch.nn.utils.clip_grad_norm_(parameters=model.parameters(), max_norm=max_grad_norm)\n","        \n","        # update parameters\n","        optimizer.step()\n","        optimizer.zero_grad()\n","        \n","    # print train loss per epoch\n","    print(\"Train loss: {}\".format(tr_loss/nb_tr_steps))\n","        "],"execution_count":0,"outputs":[{"output_type":"stream","text":["\rEpoch:   0%|          | 0/5 [00:00<?, ?it/s]"],"name":"stderr"},{"output_type":"stream","text":["***** Running training *****\n","  Num examples = 27243\n","  Batch size = 16\n","  Num steps = 8515\n"],"name":"stdout"},{"output_type":"stream","text":["/pytorch/torch/csrc/utils/python_arg_parser.cpp:756: UserWarning: This overload of add_ is deprecated:\n","\tadd_(Number alpha, Tensor other)\n","Consider using one of the following signatures instead:\n","\tadd_(Tensor other, *, Number alpha)\n","Epoch:  20%|██        | 1/5 [58:24<3:53:38, 3504.69s/it]"],"name":"stderr"},{"output_type":"stream","text":["Train loss: 0.02005093345981087\n"],"name":"stdout"},{"output_type":"stream","text":["\rEpoch:  40%|████      | 2/5 [1:56:58<2:55:22, 3507.42s/it]"],"name":"stderr"},{"output_type":"stream","text":["Train loss: 0.0026472462832295147\n"],"name":"stdout"},{"output_type":"stream","text":["\rEpoch:  60%|██████    | 3/5 [2:55:20<1:56:51, 3505.66s/it]"],"name":"stderr"},{"output_type":"stream","text":["Train loss: 0.002330057923762745\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"CHPIwbuGUi_C","colab_type":"text"},"source":["## Save model "]},{"cell_type":"code","metadata":{"id":"YsUIIUCSUi_C","colab_type":"code","colab":{}},"source":["# TODO: output/ => original data, output/sample/ => sampled data\n","bert_out_address = PARENT_DIR + \"/output/trained_v5\""],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"FpSNSkWtUi_E","colab_type":"code","colab":{}},"source":["# Make dir if not exits\n","if not os.path.exists(bert_out_address):\n","        os.makedirs(bert_out_address)"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"DdOW-B46Ui_I","colab_type":"code","colab":{}},"source":["# Save a trained model, configuration and tokenizer\n","model_to_save = model.module if hasattr(model, 'module') else model  # Only save the model it-self"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"P2uAveVXUi_L","colab_type":"code","colab":{}},"source":["# If we save using the predefined names, we can load using `from_pretrained`\n","output_model_file = os.path.join(bert_out_address, \"pytorch_model.bin\")\n","output_config_file = os.path.join(bert_out_address, \"config.json\")"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"scrolled":false,"id":"GWx_l9fkUi_N","colab_type":"code","colab":{}},"source":["# Save model into file\n","torch.save(model_to_save.state_dict(), output_model_file)\n","model_to_save.config.to_json_file(output_config_file)\n","tokenizer.save_vocabulary(bert_out_address)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"kdOCouaEzzNb","colab_type":"text"},"source":["# ----------- END OF TRAINING -----------"]},{"cell_type":"markdown","metadata":{"id":"lZpyxO9_zjCS","colab_type":"text"},"source":["# Refer to MultiRC-NER_eval note book for EVALUATIONS & ANALYSIS"]}]}